Note: This is modified version of the notebook published by CamDavidsonPilon@github.

Severel of the visualizations were converted into interactive views, using Exhibitionist. Where available, you can hit the "freeze" to convert the view into a static image for saving with the notebook. Double-clicking the static image will convert it back into a dynamic view.

You can install the Exhibitionist library, using:

pip install git+https://github.com/Exhibitionist/Exhibitionist.git

If you experience issues please update your installation to the latest git master. If they persist, please report them on the GH issues page.


In [1]:
import exhibitionist as xb # make sure xb is available
import  xbviews.app as app
views = app.get_views()
views

In [1]:
figsize(12.5,4)
import scipy.stats as stats

Chapter 1


The Philosophy of Bayesian Inference

You are a skilled programmer, but bugs still slip into your code. After a particularly difficult implementation of an algorithm, you decide to test your code on a trivial example. It passes. You test the code on a harder problem. It passes once again. And it passes the next, even more difficult, test too! You are starting to believe that there are may be no bugs present...

If you think this way, then congratulations, you already are a Bayesian practitioner! Bayesian inference is simply updating your beliefs after considering new evidence. A Bayesian can rarely be certain about a result, but he or she can be very confident. Just like in the example above, we can never be 100% sure that our code is bug-free unless we test it on every possible problem; something rarely possible in practice. Instead, we can test it on a large number of problems, and if it succeeds we can feel more confident about our code. Bayesian inference works identically: we can only update our beliefs about an outcome, rarely can we be absolutely sure unless we rule out all other alternatives. We will see that being uncertain can have its advantages.

The Bayesian state of mind

Bayesian inference differs from more traditional statistical analysis by preserving uncertainty about our beliefs. At first, this sounds like a bad statistical technique. Isn't statistics all about deriving certainty from randomness? The Bayesian method interprets probability as measure of believability in an event, that is, how confident one is in an event occuring. In fact, we will see in a moment that this is the natural interpretation of probability.

For this to be clearer, we consider an alternative interpretation of probability: Frequentist methods assume that probability is the long-run frequency of events (hence the bestowed title). For example, the probability of plane accidents under a frequentist philosophy is interpreted as the long-term frequency of plane accidents. This makes logical sense for many probabilities and events, but becomes more difficult to understand when events have no long-term frequency of occurences. Consider: we often assign probabilities to outcomes of presidential elections, but the election itself only happens once! Frequentists get around this by invoking alternative realities and saying across all these universes, the frequency of occurences is the probability.

Bayesians, on the other hand, have a more intuitive approach. Bayesians interpret a probability as measure of belief, or confidence, of an event occurring. An individual who assigns a belief of 0 to an event has no confidence that the event will occur; conversely, assigning a belief of 1 implies that the individual is absolutely certain of an event occurring. Beliefs between 0 and 1 allow for weightings of other outcomes. This definition agrees with the probability of a plane accident example, for having observed the frequency of plane accidents, an individual's belief should be equal to that frequency. Similarly, under this definition of probability being equal to beliefs, it is clear how we can speak about probabilities (beliefs) of presidential election outcomes.

Notice in the paragraph above, I assigned the belief (probability) measure to an individual, not to Nature. This is very interesting, as this definition leaves room for conflicting beliefs between individuals. Again, this is appropriate for what naturally occurs: different individuals have different beliefs of events occuring, because they possess different information about the world.

  • I flip a coin, and we both guess the result. We would both agree, assuming the coin is fair, that the probability of heads if 1/2. Assume, then, that I peek at the coin. Now I know for certain what the result is: I assign probability 1.0 to either heads or tails. Now what is your belief that the coin is heads? My knowledge of the outcome has not changed the coin's results. Thus we assign different probabilities to the result.

  • Your code either has a bug in it or not, but we do not know for certain which is true. Though we have a belief about the presence or absence of a bug.

  • A medical patient is exhibiting symptoms $x$, $y$ and $z$. There are a number of diseases that could be causing all of them, but only a single disease is present. A doctor has beliefs about which disease.

  • You believe that the beautiful girl in your English class doesn't have a crush you. You assign a low probability that she does. She, on the other hand, knows for certain that she does indeed like you. She (implicitly) assigns a probability 1.

This philosophy of treating beliefs as probability is natural to humans. We employ it constantly as we interact with the world and only see partial evidence. Alternatively, you have to be trained to think like a frequentist.

To align ourselves with traditional probability notation, we denote our belief about event $A$ as $P(A)$.

John Maynard Keynes, a great economist and thinker, said "When the facts change, I change my mind. What do you do, sir?" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even --especially-- if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A |X )$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the posterior probability so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior belief) of the above examples, after observing some evidence $X$.:

1. $P(A): \;\;$ the coin has a 50 percent chance of being heads. $P(A | X):\;\;$ You look at the coin, observe a heads, denote this information $X$, and trivially assign probability 1.0 to heads and 0.0 to tails.

2. $P(A): \;\;$ This big, complex code likely has a bug in it. $P(A | X): \;\;$ The code passed all $X$ tests; there still might be a bug, but its presence is less likely now.

3. $P(A):\;\;$ The patient could have any number of diseases. $P(A | X):\;\;$ Performing a blood test generated evidence $X$, ruling out some of the possible diseases from consideration.

4. $P(A):\;\;$ You believe that the probability that the lovely girl in your class likes you is low. $P(A | X): \;\;$ She sent you an SMS message about this Friday night. Interesting...

It's clear that in each example we did not completely discard the prior belief after seeing new evidence, but we re-weighted the prior to incorporate the new evidence (i.e. we put more weight, or confidence, on some beliefs versus others).

By introducing prior uncertainity about events, we are already admitting that any guess we make is potentially very wrong. After observing data, evidence, or other information, and we update our beliefs, our guess becomes less wrong. This is the alternative side of the prediction coin, where typically we try to be more right.

Bayesian Inference in Practice

If frequentist and Bayesian inference were computer programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, whereas the Bayesian function would return a distribution.

For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all $X$ tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all $X$ tests; is my code bug-free?" would return something very different: a distribution over YES and NO. The function might return

YES, with probability 0.8; NO, with probability 0.2

This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument: "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our personal belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences.

As we acquire more and more instances of evidence, our prior belief is washed out by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like "I expect the sun to explode today", and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better.

Denote $N$ as the number of instances of evidence we possess. As we gather an infinite amount of evidence, say as $N \rightarrow \infty$, our Bayesian results align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more unstable: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning a distribution (instead of a scalar estimate), we preserve the uncertainity to reflect the instability of stasticial inference of a small $N$ dataset.

One may think that for large $N$, one can be indifferent between the two techniques, and might lean towards the computational-simpler, frequentist methods. An analyst in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:

Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is "large enough," you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). $N$ is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.

A note on Big Data

Paradoxically, big data's predictive analytic problems are actually solved by relatively simple models [2][4]. Thus we can argue that big data's prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman's qoute from above and ask "Do I really have big data?" )

The much more difficult analytic problems involve medium data and, especially troublesome, really small data. Using a similar argument as Gelman's above, if big data problems are big enough to be readily solved, then we should be more interested in the not-quite-big enough datasets.

Our Bayesian framework

We are interested in beliefs, which can be interpreted as probabilities by thinking Bayesian. We have a prior belief in event $A$: it is what you believe before looking at any evidence, e.g., our prior belief about bugs being in our code before performing tests.

Secondly, we observe our evidence. To continue our example, if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the posterior probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after Thomas Bayes:

\begin{align} P( A | X ) = & \frac{ P(X | A) P(A) } {P(X) } \\\\[5pt] & \propto P(X | A) P(A)\;\; (\propto \text{is proportional to } ) \end{align}

The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect $P(A)$ with an updated $P(A | X )$.

Example: Bug, or just sweet, unintended feature?

Let $A$ denote the event that our code has no bugs in it. Let $X$ denote the event that the code passes all debugging tests. For now, we will leave the prior probability of no bugs as a variable, i.e. $P(A) = p$.

We are interested in $P(A|X)$, i.e. the probability of no bugs, given our debugging tests $X$. To use the formula above, we need to compute some quantities from the formula above.

What is $P(X | A)$, i.e., the probability that the code passes $X$ tests given there are no bugs? Well, it is equal to 1, for a code with no bugs will pass all tests.

$P(X)$ is a little bit trickier: The event $X$ can be divided into two possibilities, event $X$ ocurring even though our code indeed has bugs (denoted $\sim A\;$, spoken not $A$), or event $X$ without bugs ($A$). $P(X)$ can be represented as:

\begin{align} P(X ) & = P(X \text{ and } A) + P(X \text{ and } \sim A) \\\\[5pt] & = P(X|A)P(A) + P(X | \sim A)P(\sim A)\\\\[5pt] & = P(X|A)p + P(X | \sim A)(1-p) \end{align}

We have already computed $P(X|A)$ above. On the other hand, $P(X | \sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is less. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X|\sim A) = 0.5$. Then

\begin{align} P(A | X) & = \frac{1\cdot p}{ 1\cdot p +0.5 (1-p) } \\\\ & = \frac{ 2 p}{1+p} \end{align}

This is the posterior probability distribution. What does it look like as a function of our prior, $p \in [0,1]$?


In [3]:
figsize( 9, 4 )
p = np.linspace( 0,1, 50)
plt.plot( p, 2*p/(1+p), color = "#348ABD" )
plt.fill_between( p, 2*p/(1+p), alpha = .4, facecolor = ["#348ABD"])
plt.scatter( 0.2, 2*(0.2)/1.2, s = 140, c ="#348ABD"  )
plt.xlim( 0, 1)
plt.ylim( 0, 1)
plt.xlabel( "Prior, $P(A) = p$")
plt.ylabel("Posterior, $P(A|X)$, with $P(A) = p$")
plt.title( "Are there bugs in my code?");


We can see the biggest gains if we observe the $X$ tests passed are when the prior probability, $p$, is low. Let's settle on a specific value for the prior. I'm a (I think) strong programmer, so I'm going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let's pin it at 0.20. Then my updated belief that my code is bug-free is 0.33.

Recall that the prior is a probability distribution: $p$ is the prior probability that there are no bugs, so $1-p$ is the prior probability that there are bugs.

Similarly, our posterior is also a probability distribution, with $P(A | X)$ the probability there is no bug given we saw all tests pass, hence $1-P(A|X)$ is the probability there is a bug given all tests passed. What does our posterior probability distribution look like? Below is a graph of both the prior and the posterior distributions.


In [4]:
figsize( 9, 4 )
prior = [0.20, 0.80]
posterior =  [1./3, 2./3]
plt.bar( [0,.7], prior ,alpha = 0.80, width = 0.25, \
            color = "#348ABD", label = "prior distribution" )
plt.bar( [0+0.25,.7+0.25], posterior ,alpha = 0.7, \
          width = 0.25, color =  "#A60628", label = "posterior distribution" )

plt.xticks( [0.20,.95], ["Bugs Absent", "Bugs Present"] )
plt.title("Prior and Posterior probability of bugs present, prior = 0.2")
plt.ylabel("Probability")
plt.legend(loc="upper left");


Notice that after we observed $X$ occur, the probability of bugs being absent increased. By increasing the number of tests, we can approach confidence (probability 1) that there are no bugs present.

This was a very simple example of Bayesian inference and Bayes rule. Unfortunately, the mathematics necessary to perform more complicated Bayesian inference only becomes more difficult, except for artifically constructed cases. We will later see that this type of mathematical anaylsis is actually unnecessary. First we must broaden our modeling tools.


Probability Distributions

Let's quickly recall what a probability distribution is: Let $Z$ be some random variable. Then associated with $Z$ is a probability distribution function that assigns probabilities to the different outcomes $Z$ can take. There are three cases:

  • $Z$ is discrete: Discrete random variables may only assume values on a specified list. Things like populations, movie ratings, and number of votes are all discrete random variables. It's more clear when we contrast it with...

  • $Z$ is continuous: Continuous random variable can take on arbitrarily exact values. For example, temperature, speed, time, color are all modeled as continuous variables because you can constantly make the values more and more precise.

  • $Z$ is mixed: Mixed random variables assign probabilities to both discrete and continuous random variables, i.e. is is a combination of the above two categories.

Discrete Case

If $Z$ is discrete, then its distribution is called a probability mass function, which measures the probability $Z$ takes on the value $k$, denoted $P(Z=k)$. Note that the probability mass function completely describes the random variable $Z$, that is, if we know the mass function, we know how $Z$ should behave. There are popular probability mass functions that consistently appear: we will introduce them as needed, but let's introduce the first very useful probability mass function. We say $Z$ is Poisson-distributed if:

$$P(Z = k) =\frac{ \lambda^k e^{-\lambda} }{k!}, \; \; k=0,1,2, \dots $$

What is $\lambda$? It is called the parameter, and it describes the shape of the distribution. For the Poisson random variable, $\lambda$ can be any positive number. By increasing $\lambda$, we add more probability to larger values, and conversely by decreasing $\lambda$ we add more probability to smaller values. Unlike $\lambda$, which can be any positive number, $k$ must be a non-negative integer, i.e., $k$ must take on values 0,1,2, and so on. This is very important, because if you wanted to model a population you could not make sense of populations with 4.25 or 5.612 members.

If a random variable $Z$ has a Poisson mass distribution, we denote this by writing

$$Z \sim \text{Poi}(\lambda) $$

One very useful property of the Poisson random variable, given we know $\lambda$, is that its expected value is equal to the parameter, ie.:

$$E\large[ \;Z\; | \; \lambda \;\large] = \lambda $$

We will use this property often, so it's something useful to remember. Below we plot the probablity mass distribution for different $\lambda$ values. The first thing to notice is that by increasing $\lambda$ we add more probability to larger values occuring. Secondly, notice that although the graph ends at 15, the distributions do not. They assign positive probability to every non-negative integer..


In [5]:
figsize( 12.5, 4)
import scipy.stats as stats
a = np.arange( 16 )
poi = stats.poisson
lambda_ = [1.5, 4.25 ]
colours = ["#348ABD", "#A60628"]

plt.bar( a, poi.pmf( a, lambda_[0]), color=colours[0],
        label = "$\lambda = %.1f$"%lambda_[0], alpha = 0.95)
plt.bar( a, poi.pmf( a, lambda_[1]), color=colours[1],
         label = "$\lambda = %.1f$"%lambda_[1], alpha = 0.60)

plt.xticks( a + 0.4, a )
plt.legend()
plt.ylabel("probability of $k$")
plt.xlabel("$k$")
plt.title("Probability mass function of a Poisson random variable; differing \
$\lambda$ values");



In [8]:
views['PoissonDistribution'] # show interactive view

Continuous Case

Instead of a probability mass function, a continuous random variable has a probability density function. This might seem like unnecessary nomenclature, but the density function and the mass function are very different creatures. An example of continuous random variable is a random variable with a exponential density. The density function for an exponential random variable looks like:

$$f_Z(z | \lambda) = \lambda e^{-\lambda z }, \;\; z\ge 0$$

Like the Poisson random variable, an exponential random variable can only take on non-negative values. But unlike a Poisson random variable, the exponential can take on any non-negative values, like 4.25 or 5.612401. This makes it a poor choice for count data, which must be integers, but a great choice for time data, or temperature data (measured in Kelvins, of course), or any other precise and positive variable. Below are two probability density functions with different $\lambda$ value.

When a random variable $Z$ has an exponential distribution with parameter $\lambda$, we say $Z$ is exponential and write

$$Z \sim \text{Exp}(\lambda)$$

Given a specific $\lambda$, the expected value of an exponential random variable is equal to the inverse of $\lambda$, that is:

$$E[\; Z \;|\; \lambda \;] = \frac{1}{\lambda}$$

In [6]:
a = np.linspace(0,4, 100)
expo = stats.expon
lambda_ = [0.5, 1]
colours = [ "#A60628", "#348ABD"]
for l,c in zip(lambda_,colours):
    plt.plot( a, expo.pdf( a, scale=1./l), lw=3, 
                color=c, label = "$\lambda = %.1f$"%l)
    plt.fill_between( a, expo.pdf( a, scale=1./l), color=c, alpha = .33)
    
plt.legend()
plt.ylabel("PDF at $z$")
plt.xlabel("$z$")
plt.title("Probability density function of an Exponential random variable;\
 differing $\lambda$");


You can experiment with different values of the $\lambda$ parameter by executing the cell below. Hit "Freeze" to store a static version in the notebook.


In [6]:
views['ExponentialDistribution'] # show interactive view

But what is $\lambda \;\;$?

This question is what motivates statistics. In the real world, $\lambda$ is hidden from us. We only see $Z$, and must go backwards to try and determine $\lambda$. The problem is so difficult because there is not a one-to-one mapping from $Z$ to $\lambda$. Many different methods have been created to solve the problem of estimating $\lambda$, but since $\lambda$ is never actually observed, no one can say for certain which method is better!

Bayesian inference is concerned with beliefs about what $\lambda$ is. Rather than try to guess $\lambda$ exactly, we can only talk about what $\lambda$ is likely to be by assigning a probability distribution to $\lambda$.

This might seem odd at first: after all, $\lambda$ is fixed, it is not (necessarily) random! How can we assign probabilities to a non-random event. Ah, we have fallen for the frequentist interpretation. Recall, under our Bayesian philosophy, we can assign probabilties if we interpret them as beliefs. And it is entirely acceptable to have beliefs about the parameter $\lambda$.

Example: Inferring behaviour from text-message data

Let's try to model a more interesting example, concerning text-message rates:

You are given a series of text-message counts from a user of your system. The data, plotted over time, appears in the graph below. You are curious if the user's text-messaging habits changed over time, either gradually or suddenly. How can you model this? (This is in fact my own text-message data. Judge my popularity as you wish.)


In [7]:
figsize( 12, 3.5 )
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar( np.arange( n_count_data ), count_data, color ="#348ABD"  )
plt.xlabel( "Time (days)")
plt.ylabel("# Text-msg received")
plt.title("Did the user's texting habits change over time?")
plt.xlim( 0, n_count_data );


Before we begin, with resepect to the plot above, would you say there was a change in behaviour during the time period?

How can we start to model this? Well, as I conveniently already introduced, a Poisson random variable would be a very appropriate model for this count data. Denoting day $i$'s text-message count by $C_i$,

$$ C_i \sim \text{Poisson}(\lambda) $$

We are not sure about what the $\lambda$ parameter is though. Looking at the chart above, it appears that the rate might become higher at some later date, which is equivalently saying the parameter $\lambda$ increases at some later date (recall a higher $\lambda$ means more probability on larger outcomes, that is, higher probability of many texts.).

How can we mathematically represent this? We can think, that at some later date (call it $\tau$), the parameter $\lambda$ suddenly jumps to a higher value. So we create two $\lambda$ parameters, one for behaviour before the $\tau$, and one for behaviour after. In literature, a sudden transition like this would be called a switchpoint:

$$ \lambda = \begin{cases} \lambda_1 & \text{if } t \lt \tau \cr \lambda_2 & \text{if } t \ge \tau \end{cases} $$

If, in reality, no sudden change occurred and indeed $\lambda_1 = \lambda_2$, the $\lambda$'s posterior distributions should look about equal.

What would be good prior distributions for $\lambda_1$ and $\lambda_2$? Recall that $\lambda_i, \; i=1,2,$ can be any positive number. The exponential random variable has a density function for any positive number. This would be a good choice to model $\lambda_i$. But again, we need a parameter for this exponential distribution: call it $\alpha$.

\begin{align} &\lambda_1 \sim \text{Exp}( \alpha ) \\\ &\lambda_2 \sim \text{Exp}( \alpha ) \end{align}

$\alpha$ is called a hyper-parameter, or a parent-variable, literally a parameter that influences other parameters. The influence is not too strong, so we can choose $\alpha$ liberally. A good rule of thumb is to set the exponential parameter equal to the inverse of the average of the count data, since we're modeling $\\lambda$ using an Exponential distribution we can use the expected value identity shown earlier to get:

$$\frac{1}{N}\sum_{i=0}^N \;C_i \approx E[\; \lambda \; |\; \alpha ] = \frac{1}{\alpha}$$

Alternatively, and something I encourage the reader to try, is to have two priors: one for each $\lambda_i$; creating two exponential distributions with different $\alpha$ values reflects a prioer belief that the rate changed after some period.

What about $\tau$? Well, due to the randomness, it is too difficult to pick out when $\tau$ might have occurred. Instead, we can assign an uniform prior belief to every possible day. This is equivalent to saying

\begin{align} & \tau \sim \text{DiscreteUniform(1,70) }\\\\ & \Rightarrow P( \tau = k ) = \frac{1}{70} \end{align}

So after all this, what does our prior distribution look like? Frankly, it doesn't matter. What we should understand is that it would be an ugly, complicated, mess involving symbols only a mathematician would love. And things would only get uglier the more complicated our models become. Regardless, all we really care about is the posterior distribution anyways. We next turn to PyMC, a Python library for performing Bayesian analysis, and is agnostic to the mathematical monster we have created.

Introducing our first hammer: PyMC

PyMC is a Python library for programming Bayesian analysis [3]. It is a fast, well-maintained library. The only unfortunate part is that documentation can be lacking in areas, especially the bridge between problem to solution. One this book's main goals is to solve that problem, and also to demonstrate why PyMC is so cool.

We will model the above problem using the PyMC library. This type of programming is called probabilistic programming, an unfortunate misnomer that invokes ideas of randomly-generated code and has likely confused and frightened users away from this field. The code is not random. The title is given because we create probability models using pogramming variables as the model's components. This will be the last time I use the term probablistic programming. Instead, I'll simply use programming, as that is what it really is.

The PyMC code is easy to follow along: the only novel thing should be the syntax, and I will interrupt the code to explain sections. Simply remember we are representing the model's components ($\tau, \lambda_1, \lambda_2$ ) as variables:


In [8]:
import pymc as mc

n = count_data.shape[0]

alpha = 1.0/count_data.mean() #recall count_data is 
                              #the variable that holds our txt counts

lambda_1 = mc.Exponential( "lambda_1",  alpha )
lambda_2 = mc.Exponential( "lambda_2", alpha )

tau = mc.DiscreteUniform( "tau", lower = 0, upper = n )

In the above code, we create the PyMC variables corresponding to $\lambda_1, \; \lambda_2$ in lines 8,9. We assign them to PyMC's stochastic variables, called stochastic variables because they are treated by the backend as random number generators. We can test this by calling their built-in random() method.


In [9]:
print "Random output:", tau.random(),tau.random(), tau.random()


Random output: 64 59 36

In [10]:
@mc.deterministic
def lambda_( tau = tau, lambda_1 = lambda_1, lambda_2 = lambda_2 ):
    out = np.zeros( n )  
    out[:tau] = lambda_1 #lambda before tau is lambda1
    out[tau:] = lambda_2 #lambda after tau is lambda2
    return out

This code is creating a new function lambda_, but really we think of it as a random variable: the random variable $\lambda$ from above. Note that because lambda_1, lambda_2 and alpha are random, lambda_ will be random. We are not fixing any variables yet. The @mc.deterministic is a decorator to tell PyMC that this is a deterministic function, i.e., if the arguments were deterministic (which they are not), the output would be deterministic as well.


In [11]:
observation = mc.Poisson( "obs", lambda_, value = count_data, observed = True)

model = mc.Model( [observation, lambda_1, lambda_2, tau] )

The variable observations combines our data, count_data, with our proposed data-generation scheme, given by the variable lambda_, through the value keyword. We also set observed = True to tell PyMC that this should stay fixed in our analysis. Finally, PyMC wants us to collect all the variables of interest and create a Model instance out of them. This makes our life easier when we try to retrieve the results.


In [12]:
### Myserious code to be explained later.
mcmc = mc.MCMC(model)
mcmc.sample( 100000, 50000, 1 )


[****************100%******************]  100000 of 100000 complete

The above code will be explained in the Chapter 3, but this is where our results come from. The machinery being employed is called Monte Carlo Markov Chains (which I delay explaining until Chapter 3). It returns thousands of random variables from the posterior distributions of $\lambda_1, \lambda_2$ and $\tau$. We can plot a histogram of the random variables to see what the posterior distribution looks like. Below, we collect the samples (called traces in MCMC literature) in histograms.


In [13]:
lambda_1_samples = mcmc.trace( 'lambda_1' )[:]
lambda_2_samples = mcmc.trace( 'lambda_2' )[:]
tau_samples = mcmc.trace( 'tau' )[:]

In [58]:
figsize(12.5, 10)
#histogram of the samples:

ax = plt.subplot(311)
ax.set_autoscaley_on(False)

plt.hist( lambda_1_samples, histtype='stepfilled', bins = 60, alpha = 0.85, 
        label = "posterior of $\lambda_1$", color = "#A60628",normed = True )
plt.legend(loc = "upper left")
plt.title(r"Posterior distributions of the variables $\lambda_1,\;\lambda_2,\;\tau$")
plt.xlim([15,30])
plt.xlabel("$\lambda$ value")
plt.ylabel("probability")

ax = plt.subplot(312)
ax.set_autoscaley_on(False)

plt.hist( lambda_2_samples,histtype='stepfilled', bins = 60, alpha = 0.85, 
          label = "posterior of $\lambda_2$",color="#7A68A6", normed = True )
plt.legend(loc = "upper left")
plt.xlim([15,30])
plt.xlabel("$\lambda$ value")
plt.ylabel("probability")

plt.subplot(313)


w = 1.0/ tau_samples.shape[0] * np.ones_like( tau_samples )
plt.hist( tau_samples, bins = n_count_data, alpha = 1, 
         label = r"posterior of $\tau$",
         color="#467821", weights=w, rwidth =1. )

plt.legend(loc = "upper left");
plt.ylim([0,.75])
plt.xlim([35, len(count_data)-20])
plt.xlabel("days")
plt.ylabel("probability");



In [3]:
views['MCMCView'] # show interactive view

Interpretation

Recall that the Bayesian methodology returns a distribution, hence we now have distributions to describe the unknown $\lambda$'s and $\tau$. What have we gained? Immediately we can see the uncertainty in our estimates: the more variance in the distribution, the less certain our posterior belief should be. We can also say what a plausible value for the parameters might be: $\lambda_1$ is around 18 and $\lambda_2$ is around 23. What other observations can you make? Look at the data again, do these seem reasonable? The distributions of the two $\\lambda$s are positioned very differently, indicating that it's likely there was a change in the user's text-message behaviour.

Also notice that posteriors' distributions do not look like any Poisson distributions, though we originally started modelling with Poisson random variables. They are really not anything we recognize. But this is OK. This is one of the benefits of taking a computational point-of-view. If we had instead done this mathematically, we would have been stuck with a very analytically intractable (and messy) distribution. Via computations, we are agnostic to the tractability.

Our analysis also returned a distribution for what $\tau$ might be. Had no change occurred, or the change been gradual over time, the posterior distribution of $\tau$ would have been more spread out, reflecting that many values are likely candidates for $\tau$. On the contrary, it is very peaked. It appears that near day 45, the individual's text-message behavior suddenly changed.

Why would I want samples from the posterior, anyways?

We will deal with this question for the remainder of the book, and it is an understatement to say we can perform amazingly useful things. For now, let's , let's end this chapter with one more example. We'll use the posterior samples to answer the following question: what is the expected number of texts at day $t, \; 0 \le t \le70$? Recall that the expected value of a Poisson is equal to its parameter $\lambda$, then the question is equivalent to what is the expected value of $\lambda$ at time $t$?

In the code below, we are calculating the following: Let $i$ index samples from the posterior distributions. Given a day $t$, we average over all possible $\lambda_i$ for that day $t$, using $\lambda_i = \lambda_{1,i}$ if $t \lt \tau_i$ (that is, if the behaviour change hadn't occured yet), else we use $\lambda_i = \lambda_{2,i}$.



In [59]:
figsize( 12.5, 4)
# tau_samples, lambda_1_samples, lambda_2_samples contain
# N samples from the corresponding posterior distribution
N = tau_samples.shape[0]
expected_texts_per_day = np.zeros(n_count_data)
for  day in range(0, n_count_data):
    # ix is a bool index of all tau samples corresponding to
    # the switchpoint occuring prior to value of 'day'
    ix = day < tau_samples
    # Each posterior sample corresponds to a value for tau.
    # for each day, that value of tau indicates whether we're "before" (in the lambda1 "regime")
    # or "after" (in the lambda2 "regime") the switchpoint.
    # by taking the posterior sample of lambda1/2 accordingly, we can average
    # over all samples to get an expected value for lambda on that day.
    # As explained, the "message count" random variable is Poisson distributed, 
    # and therefore lambda (the poisson parameter) is the expected value of "message count"
    expected_texts_per_day[day] = (lambda_1_samples[ix].sum() 
                                    + lambda_2_samples[~ix].sum() ) /N

    
plt.plot( range( n_count_data), expected_texts_per_day, lw =4, color = "#E24A33" )
plt.xlim( 0, n_count_data )
plt.xlabel( "Day" )
plt.ylabel( "Expected # text-messages" )
plt.title( "Expected number of text-messages received")
#plt.ylim( 0, 35 )
plt.bar( np.arange( len(count_data) ), count_data, color ="#348ABD", alpha = 0.5,
            label="observed texts per day")

plt.legend(loc="upper left");


Our analysis shows strong support for believing the user's behavior did change ($\lambda_1$ would have been been close in value to $\lambda_2$ had this been true), and the the change was sudden rather then gradual (demonstrated by $\tau$ 's strongly peaked posterior distribution). We can speculate what might have caused this: a cheaper text-message rate, a recent weather-2-text subscription, or a new relationship. (The 45th day corresponds to Christmas, and I moved away to Toronto the next month leaving a girlfriend behind)

Exercises

1. Using lambda_1_samples and lambda_2_samples, what is the mean of the posterior distributions of $\lambda_1$ and $\lambda_2$?


In [17]:
#type your code here.

2. What is the expected percent increase text-message rates? hint: compute the mean of lambda_1_samples/lambda_2_samples. Note that quanitity is very different from lambda_1_samples.mean()/lambda_2_samples.mean().


In [16]:
#type your code here.

3. Looking at the posterior distribution graph of $\tau$, why do you think there is a small number of posterior $\tau$ samples near 0? hint: Look at the data again.

4. What is the mean of $\lambda_1$ given we know $\tau$ is less than 45. That is, suppose we have new information as we know for certain that the change in behaviour occured before day 45. What is the expected value of $\lambda_1$ now? (You do not need to redo the PyMC part, just consider all instances where tau_trace<45. )


In [ ]:
#type your code here.

References

  • [1] Gelman, Andrew. N.p.. Web. 22 Jan 2013. http://andrewgelman.com/2005/07/n_is_never_larg/.
  • [2] Norvig, Peter. 2009. The Unreasonable Effectivness of Data.
  • [3] Patil, A., D. Huard and C.J. Fonnesbeck. 2010. PyMC: Bayesian Stochastic Modelling in Python. Journal of Statistical Software, 35(4), pp. 1-81.
  • [4] Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), pages 793-804, May 2012, Scottsdale, Arizona.

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]:

In [ ]: